Red Wine Exploration

by David Vartanian

Abstract

I describe a dataset with almost 1600 types of red wine, in order to understand the meaning of the assigned score.

Introduction

This dataset is provided by Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis, from different universities in Portugal. It provides information like acidity, residual sugar, chlorides, and alcohol among others. I explore the data to find patterns and trends and get the meaning of the given features. More information here.

Univariate Plots Section

Let’s start showing some summary numbers and first histograms to understand individual variables.

Quality

This document is all about quality. Here is the distribution of wine by quality.

Histograms: quality, fixed.acidity, total.sulfur.dioxide, alcohol

These histograms show how the values are distributed in the different variables.

Outliers & Statistical Info

# Statistical information about Volatile Acidity
summary(data$volatile.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

There are a few outliers only on the right side.

# Statistical information about Fixed Acidity
summary(data$fixed.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

There are several outliers only on the right side.

# Statistical information about Total Sulphure Dioxide
summary(data$total.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

There are many outliers only on the right side.

# Statistical information about Alcohol
summary(data$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

There are just a few outliers only on the right side.

# Statistical information about Chlorides
summary(data$chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

There are just a few outliers on the left side, and many on the right side.

# Statistical information about Residual Sugar
summary(data$residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

There are many outliers on the right side.

# Statistical information about pH
summary(data$pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

All values are pretty well distributed in the pH variable. There are several outliers on both sides.

# Statistical information about Sulphates
summary(data$sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

There are many outliers only on the right side.

# Statistical information about Density
summary(data$density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

This variable is also well distributed. There are several outliers on both sides.

Univariate Analysis

Dataset Structure

There are 9 continuous variables, 2 discrete variables and one ordered categorical variable: quality.

Main dataset interest

My general question is, how do chemical properties define the quality of the red wine?

There are interesting features in this dataset, each of them describing an important property of the red wine. Density, pH, sulphur dioxide, and sulphates are, in my opinion, the most important ones, in order to measure the quality. Let’s see what we can find by looking at those variables.

Variable Transformations

It was not necessary to clean missing values on this dataset. However, I think it is a good idea to apply some transformations to skewed variables.

Transformed Volatile Acidity using log base 10.

Transformed Tartaric Acid using log base 10.

Transformed Total Sulphure Dioxide using log base 10.

Transformed Chlorides using log base 10.

Transformed Residual Sugar using log base 10.

Transformed Sulphates using log base 10.

Transformed Free Sulphure Dioxide using log base 10.

Durability

Using the new variable durability, it’s possible to appreciate the effect of sulphates and free sulphure dioxide.

This variable has only two values: S (short) and L (large), using the median of sulphates and free sulphure dioxide as inflection point.

Bivariate Plots Section

Let’s try to find trends and interesting patterns by comparing two variables.

cor(x = data$quality, y = data$alcohol)
## [1] 0.4761663

Fact: Higher quality wines seem to have higher levels of alcohol

# Correlation between Quality and Fixed Acidity
cor(x = data$quality, y = data$fixed.acidity)
## [1] 0.1240516

Fact: Higher quality wines seem to have lower levels of acidity

# Correlation between Quality and Density
cor(x = data$quality, y = data$density)
## [1] -0.1749192

Fact: Higher quality wines seem to have lower density

Citric Acid adds freshness flavor to the wine.

# Correlation between Citric Acid and pH
cor(x = data$citric.acid, y = data$pH)
## [1] -0.5419041

Level of acetic acid. Too high levels make an unpleasant vinegar taste.

# Correlation between Volatile Acidity and pH
cor(x = data$volatile.acidity, y = data$pH)
## [1] 0.2349373

Bivariate Analysis

Relationships

# Correlation between Density and Quality
cor(x = data$density, y = data$quality)
## [1] -0.1749192

I’ve found a slightly positive correlation, meaning that density tends to be lower on high-quality wines. However, this correlation is not so important to determine the quality as shown by the correlation coefficient.

## Warning: Removed 46 rows containing non-finite values (stat_smooth).
## Warning: Removed 46 rows containing missing values (geom_point).

# Correlation between Chlorides and Sulphates
cor(x = data$chlorides, y = data$sulphates)
## [1] 0.3712605

I’ve found that levels are mostly low for both variables. I would say that they don’t influence much on the quality as all types of wine have the same level of these two variables.

## Warning: Removed 48 rows containing non-finite values (stat_smooth).
## Warning: Removed 48 rows containing missing values (geom_point).

# Correlation between Chlorides and Residual Sugar
cor(x = data$chlorides, y = data$residual.sugar)
## [1] 0.05560954

I’ve found the same here, as they keep levels constantly low and correlation coefficient is almost 0.

# Correlation between Free Sulphure Dioxide and Total Sulphure Dioxide
cor(x = data$free.sulfur.dioxide, y = data$total.sulfur.dioxide)
## [1] 0.6676665

Levels are always low. However, these two variables seem to be correlated.

Interesting relationships

So far I find only density to be an interesting variable to look at. The rest, chlorides, sulphates, residual sugar and sulphure dioxide don’t seem to be a great influence on wine quality.

pH

This variable indicates the acidity level of the wine. The scale goes from 0 (very acid) to 14 (very basic). But most of red wines are statistically between 3 and 4.

Another point of view:

It’s quite surprising that levels of pH are lower on high-quality wines.

Density of water

The level of this variable depends on alcohol percentage and sugar.

Another point of view:

Density levels are also lower on high-quality wines.

Free Sulphure Dioxide

The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion. It prevents microbial growth and the oxidation of the wine.

Another point of view:

Again, the levels for this variable are lower for both low-quality and high-quality wines.

Sulphates

Additive contributing with sulphure dioxide gas (S02) levels, acting as an antimicrobial and antioxidant.

Sulphates levels are lower for low-quality and high-quality wines as well.

Multivariate Plots Section

A quite strong correlation can be observed between these two variables, regarding the quality of wines. Meaning that it’s normal to find lower levels of pH and density on high-quality wines. The lines colours let you see how durable the wine can be respect to alcohol, using the durability variable introduced above. It makes sense to me that wines last longer if they contains more alcohol in addition to sulphates and free sulphure dioxide.

I’ve found here another interesting correlation, which becomes quite obvious if we pay special attention to the meaning of the variables. Density, as I said above, is actually density of water. So, the more alcohol the less water. The coloured lines show that the durability of the wine is lower when the density of water is higher. Does it make sense?

Multivariate Analysis

##     quality      mean_quality   mean_alcohol    mean_density   
##  Min.   :3.00   Min.   :3.00   Min.   : 9.90   Min.   :0.9952  
##  1st Qu.:4.25   1st Qu.:4.25   1st Qu.:10.03   1st Qu.:0.9962  
##  Median :5.50   Median :5.50   Median :10.45   Median :0.9966  
##  Mean   :5.50   Mean   :5.50   Mean   :10.72   Mean   :0.9965  
##  3rd Qu.:6.75   3rd Qu.:6.75   3rd Qu.:11.26   3rd Qu.:0.9970  
##  Max.   :8.00   Max.   :8.00   Max.   :12.09   Max.   :0.9975  
##     mean_ph      mean_citric_acid       n         
##  Min.   :3.267   Min.   :0.1710   Min.   : 10.00  
##  1st Qu.:3.294   1st Qu.:0.1915   1st Qu.: 26.75  
##  Median :3.312   Median :0.2588   Median :126.00  
##  Mean   :3.327   Mean   :0.2715   Mean   :266.50  
##  3rd Qu.:3.366   3rd Qu.:0.3498   3rd Qu.:528.25  
##  Max.   :3.398   Max.   :0.3911   Max.   :681.00

Final Plots and Summary

Durability & Alcohol

Something very remarkable to keep in mind is what this plot shows: high-quality wines seem to last longer. But the orange line on the top-right corner makes a huge difference. They last much longer when alcohol level is higher.

Citric Acid vs. pH

This is a pretty straight forward correlation. When pH level gets lower (which means that there is more acid) citric acid gets higher. It makes sense, doesn’t it?

Density by Quality Level

I wanted to emphasize this plot again because levels of density look similar for both low-quality and high-quality wines. Or from another perspective, the density of water is higher only on mid-quality wines.


Reflection

I feel that now I have a few extra tips to select new wines to taste. Higher levels of alcohol and acidity, lower levels of density, as well as low levels of residual sugar, chlorides, and sulphates. High levels of alcohol and low level of density were definitely surprising for me. However, I think that the data set needs some more categorical variables and much more data to make better analysis.

For instance, adding columns with usual customers, sommeliers preferences, country of origin, types of grape, altitude of grape crops, and type of cask used to keep them before selling would be of great value to measure wine quality beyond the product itself, but also the background environment and production process.